{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nDeploy a Framework-prequantized Model with TVM\n==============================================\n**Author**: `Masahiro Masuda <https://github.com/masahi>`_\n\nThis is a tutorial on loading models quantized by deep learning frameworks into TVM.\nPre-quantized model import is one of the quantization support we have in TVM. More details on\nthe quantization story in TVM can be found\n`here <https://discuss.tvm.apache.org/t/quantization-story/3920>`_.\n\nHere, we demonstrate how to load and run models quantized by PyTorch, MXNet, and TFLite.\nOnce loaded, we can run compiled, quantized models on any hardware TVM supports.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, necessary imports\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from PIL import Image\n\nimport numpy as np\n\nimport torch\nfrom torchvision.models.quantization import mobilenet as qmobilenet\n\nimport tvm\nfrom tvm import relay\nfrom tvm.contrib.download import download_testdata"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Helper functions to run the demo\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_transform():\n    import torchvision.transforms as transforms\n\n    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])\n    return transforms.Compose(\n        [\n            transforms.Resize(256),\n            transforms.CenterCrop(224),\n            transforms.ToTensor(),\n            normalize,\n        ]\n    )\n\n\ndef get_real_image(im_height, im_width):\n    img_url = \"https://github.com/dmlc/mxnet.js/blob/main/data/cat.png?raw=true\"\n    img_path = download_testdata(img_url, \"cat.png\", module=\"data\")\n    return Image.open(img_path).resize((im_height, im_width))\n\n\ndef get_imagenet_input():\n    im = get_real_image(224, 224)\n    preprocess = get_transform()\n    pt_tensor = preprocess(im)\n    return np.expand_dims(pt_tensor.numpy(), 0)\n\n\ndef get_synset():\n    synset_url = \"\".join(\n        [\n            \"https://gist.githubusercontent.com/zhreshold/\",\n            \"4d0b62f3d01426887599d4f7ede23ee5/raw/\",\n            \"596b27d23537e5a1b5751d2b0481ef172f58b539/\",\n            \"imagenet1000_clsid_to_human.txt\",\n        ]\n    )\n    synset_name = \"imagenet1000_clsid_to_human.txt\"\n    synset_path = download_testdata(synset_url, synset_name, module=\"data\")\n    with open(synset_path) as f:\n        return eval(f.read())\n\n\ndef run_tvm_model(mod, params, input_name, inp, target=\"llvm\"):\n    with tvm.transform.PassContext(opt_level=3):\n        lib = relay.build(mod, target=target, params=params)\n\n    runtime = tvm.contrib.graph_executor.GraphModule(lib[\"default\"](tvm.device(target, 0)))\n\n    runtime.set_input(input_name, inp)\n    runtime.run()\n    return runtime.get_output(0).numpy(), runtime"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A mapping from label to class name, to verify that the outputs from models below\nare reasonable\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "synset = get_synset()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Everyone's favorite cat image for demonstration\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "inp = get_imagenet_input()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Deploy a quantized PyTorch Model\n--------------------------------\nFirst, we demonstrate how to load deep learning models quantized by PyTorch,\nusing our PyTorch frontend.\n\nPlease refer to the PyTorch static quantization tutorial below to learn about\ntheir quantization workflow.\nhttps://pytorch.org/tutorials/advanced/static_quantization_tutorial.html\n\nWe use this function to quantize PyTorch models.\nIn short, this function takes a floating point model and converts it to uint8.\nThe model is per-channel quantized.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def quantize_model(model, inp):\n    model.fuse_model()\n    model.qconfig = torch.quantization.get_default_qconfig(\"fbgemm\")\n    torch.quantization.prepare(model, inplace=True)\n    # Dummy calibration\n    model(inp)\n    torch.quantization.convert(model, inplace=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Load quantization-ready, pretrained Mobilenet v2 model from torchvision\n-----------------------------------------------------------------------\nWe choose mobilenet v2 because this model was trained with quantization aware\ntraining. Other models require a full post training calibration.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "qmodel = qmobilenet.mobilenet_v2(pretrained=True).eval()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Quantize, trace and run the PyTorch Mobilenet v2 model\n------------------------------------------------------\nThe details are out of scope for this tutorial. Please refer to the tutorials\non the PyTorch website to learn about quantization and jit.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pt_inp = torch.from_numpy(inp)\nquantize_model(qmodel, pt_inp)\nscript_module = torch.jit.trace(qmodel, pt_inp).eval()\n\nwith torch.no_grad():\n    pt_result = script_module(pt_inp).numpy()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Convert quantized Mobilenet v2 to Relay-QNN using the PyTorch frontend\n----------------------------------------------------------------------\nThe PyTorch frontend has support for converting a quantized PyTorch model to\nan equivalent Relay module enriched with quantization-aware operators.\nWe call this representation Relay QNN dialect.\n\nYou can print the output from the frontend to see how quantized models are\nrepresented.\n\nYou would see operators specific to quantization such as\nqnn.quantize, qnn.dequantize, qnn.requantize, and qnn.conv2d etc.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "input_name = \"input\"  # the input name can be be arbitrary for PyTorch frontend.\ninput_shapes = [(input_name, (1, 3, 224, 224))]\nmod, params = relay.frontend.from_pytorch(script_module, input_shapes)\n# print(mod) # comment in to see the QNN IR dump"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Compile and run the Relay module\n--------------------------------\nOnce we obtained the quantized Relay module, the rest of the workflow\nis the same as running floating point models. Please refer to other\ntutorials for more details.\n\nUnder the hood, quantization specific operators are lowered to a sequence of\nstandard Relay operators before compilation.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tvm_result, rt_mod = run_tvm_model(mod, params, input_name, inp, target=\"llvm\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Compare the output labels\n-------------------------\nWe should see identical labels printed.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pt_top3_labels = np.argsort(pt_result[0])[::-1][:3]\ntvm_top3_labels = np.argsort(tvm_result[0])[::-1][:3]\n\nprint(\"PyTorch top3 labels:\", [synset[label] for label in pt_top3_labels])\nprint(\"TVM top3 labels:\", [synset[label] for label in tvm_top3_labels])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "However, due to the difference in numerics, in general the raw floating point\noutputs are not expected to be identical. Here, we print how many floating point\noutput values are identical out of 1000 outputs from mobilenet v2.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"%d in 1000 raw floating outputs identical.\" % np.sum(tvm_result[0] == pt_result[0]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Measure performance\n-------------------------\nHere we give an example of how to measure performance of TVM compiled models.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_repeat = 100  # should be bigger to make the measurement more accurate\ndev = tvm.cpu(0)\nprint(rt_mod.benchmark(dev, number=1, repeat=n_repeat))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>We recommend this method for the following reasons:\n\n   * Measurements are done in C++, so there is no Python overhead\n   * It includes several warm up runs\n   * The same method can be used to profile on remote devices (android etc.).</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Unless the hardware has special support for fast 8 bit instructions, quantized models are\n  not expected to be any faster than FP32 models. Without fast 8 bit instructions, TVM does\n  quantized convolution in 16 bit, even if the model itself is 8 bit.\n\n  For x86, the best performance can be achieved on CPUs with AVX512 instructions set.\n  In this case, TVM utilizes the fastest available 8 bit instructions for the given target.\n  This includes support for the VNNI 8 bit dot product instruction (CascadeLake or newer).\n\n  Moreover, the following general tips for CPU performance equally applies:\n\n   * Set the environment variable TVM_NUM_THREADS to the number of physical cores\n   * Choose the best target for your hardware, such as \"llvm -mcpu=skylake-avx512\" or\n     \"llvm -mcpu=cascadelake\" (more CPUs with AVX512 would come in the future)</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Deploy a quantized MXNet Model\n------------------------------\nTODO\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Deploy a quantized TFLite Model\n-------------------------------\nTODO\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}